The original dataset contains RNA-seq count data generated by profiling peripheral blood from 62 COVID-19 patients (COVID-19 group) and 24 healthy controls (HC group) via bulk RNA-seq. It is publicly available on Gene Expression Omnibus database (GEO) under accession number GSE152641[1] as a count matrix with 20460 unique Entrez gene ID’s along the rows and 86 samples (biological replicates) along the columns. The original study not only performed expression analysis on genes of these 86 samples, but also compiled expression data of prior studies on six viruses: influenza, RSV, HRV, Ebola, Dengue, and SARS in an attempt to isolate COVID-19 biomarkers from other viral infections. Here we will only perform an expression analysis for the RNA-seq count matrix provided by this study. The count matrix file was downloaded using GEOquery[2] version 2.55.1. TMM normalisation was applied to the data using edgeR[3][4] version 3.30.0 before exploratory data analysis. During the normalisation process, 6035 genes had expression values that met the criterium of being outliers as per the edgeR protocol[5] and hence were removed. The mapping from Entrez ID against the official gene symbol (HUGO/HGNC) of the genes was done using annotate[6] and database org.Hs.eg.db[7]. The mapping resulted in one duplicate record of gene ZFHX3, to which the gene of reference ID 388289 and gene 463 were both mapped:
| Entrez ID | Gene Symbol | Status |
|---|---|---|
| 388289 | C16orf47 | Replaced with Gene ID: 463, HUGO symbol: ZFHX3 |
Since the expression values for these two ID’s are identical, and ID 388289 is obsolete, I decided to remove the record for ID 388289 from the normalised expression matrix; retaining it could introduce noises in further analysis.
For the exploratory analysis, edgeR[3][4] was also used to produce a multidimensional scaling (MDS) plot, a Biological Coefficient of Variation (BCV) plot, and a mean-variance plot to present evidence of differences between our biological replicates of the two groups (HC and COVID-19).
The MDS plot presents variation amongst samples based on the normalised gene expression. The distances between each pair of samples on the MDS plot suggest their leading logFC. The healthy control (HC) and the patients (COVID-19) samples form separate clusters along the plot dimension 1. There are however still several COVID-19 samples mixed in the HC cluster.
The BCV plot presents the estimated relative variability of expression between biological replicates; it illustrates the association between the biological CV and the average true gene abundance. The common dispersion line suggests all gene expression values vary by close to a BCV value of 0.5 amongst replicates.[8] The tagwise dispersion shows BCV values calculated individually for each gene. We observe that genes with higher true abundance (under the assumption that RNA-seq counts follow a negative binomial distribution) have lower BCV’s than genes with lower abundance. This imples that the higher the true expression level of a gene has, the lower its variation in our samples.
The mean-variance plot presents the modelling of the mean-variance relationship for the normalised expression values, which are split into bins by overall expression level in the model. The grey points represent the pooled genewise variances. The blue points on the mean-variance plot represent the estimated genewise variances. The red crosses represent the pooled genewise variances, while the dark red crosses represent the the average of the pooled variances for each bin of genes plotted against the average expression level of the genes in the bin. The blue line shows the mean-variance relationship for a negative binomial model with common dispersion.[3][4] We obeserve that all types of variances fit well along the negative binomial line.
Note that this result can only be reproduced using Bioconductor 3.11; using other versions of Bioconductor will lead to a different mapping of gene names.
Since we hypothesized that COVID-19 status of samples was the only factor contributing to differential gene expression, we modeled on group (COVID-19 or HC). By fitting this model and the normalised RNA-seq data to a Quasi-Likelihood(QL) negative binomial (NB) generalised log-linear model using edgeR[3][4], the QL dispersion was estimated, which suggests the level at which the COVID-19 status of a sample can account for the sample’s expression levels of the genes. The fitted result was then used to conduct genewise QL F-test for the coefficients of our two groups of samples. Below is the result of the QL F-test showing the number of genes that passed the significant threshold \(p < 0.05\), and genes that passed that the threshold \(0.05\) for Benjamini-Hochberg FDR-correction:
| Number of genes with p-value < 0.05 | Number of genes that pass after correction |
|---|---|
| 8162 | 7260 |
This threshold for evidence of differential expression was chosen because we would like to capture the genes that only have less 5% chance to show such differences in expression between groups if they were non-differentially expressed genes. The false discovery rate (FDR/Benjamini-Hochberg) method was applied because we need to control for the the likelihood of false positive results that would increase by chance with the increasing number of tests performed. The threshold for the corrected p-value (FDR) was \(<0.05\), as genes that have false positive results for fewer than 5% of the significant tests are of interest.
A Volcano plot was chosen over an MA plot to show the amount of differentially expressed genes because a volcano plot can answer such questions as how many genes that passed the correction are up-regulated or are down-regulated, while an MA plot fails to visualise the association between the results of multiple hypotheses testing and differential expression. Note that the p-value used for the volcano plot is the FDR-corrected p-values. Two genes of interest: ACO1 and ATL3 were highlighted. These two genes were identified by the original study to be the most important COVID-19 signature genes. Thair .et al discovered that ACO1 and ATL3 are completely oppositely regulated between COVID-19 patients and non-COVID-19 patients; ACO1 is up-regulated in COVID-19 patients and down-regulated in patients with non-COVID-19 viral infections, while ATL3 is down-regulated in COVID-19 patients and up-regulated in non-COVID-19 patients. The authors also found that ZC3H13 is the most down-regulated genes among the COVID-19 signature genes.[1] The position on our volcano plot reflect that the result of our differential expression analysis by QLM coincides with this observation made by Thair .et al; ACO1 is down-regulated while ATL3 is up-regulated in the healthy control.
The heatmap plotted from the top-hit genes (\(FDR < 0.05\)) from the differential expression analysis exhibits clusters of up-regulated and down-regulated expressions between the two group. Below is a table presenting how many significant differentially expressed genes are up-regulated and how many down-regulated:
| Upregulated top-hit genes | Downregulated top-hit genes |
|---|---|
| 3260 | 4000 |
Three thresholded ORA’s on all the top-hit differentially expressed genes, the up-regulated top-hits, and the down-regulated top-hots from the QLF result, were performed respectively using g:Profiler[9].
Since this is a pathway analysis, we are only interested in data sources for biological pathways. Hence, Reactome, Wikipathways, and KEGG were used as our annotation data sources. We will also include GO:Biological process (GO:BP) as our data source. Although a biological process is different from a pathway; however, pathways are considered to collectively participate in biological processes. Therefore, GO:BP was too included. The databases used by the current release of g:Profiler are Ensembl 102, which is also the current version of GO database, and Ensembl Genomes 49. The currect release of Reactome used is version 75. Wikipathways is a comunity-maintained database and is being constantly updated. The current release of KEGG used is 97.0.
The p-value threshold for the significance of the gene-set enrichment of the Fisher’s exact test calculated by g:Profiler was set to \(0.05\). The numbers of genesets that were returned with this threshold by each query would be presented as a summary below. Benjamini–Hochberg FDR for p-value adjustment method was chosen, and IEA-annotated results (Inferred from Electronic Annotation) were excluded since the human-curated results are desirable for our first analysis.
We first performed a thresholded ORA on all the QLF significant genes. Below is a table summary of its g:Profiler query result:
| Genesets Returned | Significant Genesets | Ambiguous Genes | Failed Genes |
|---|---|---|---|
| 14777 | 865 | 63 | 24 |
An interactive Manhattan plot combining the results of different annotation sources:
As for the detailed query result, however, there were too many pathways to check. Since GO terms are propagated, there might be terms that are too board and not so informative. Hence we limited the term size to fewer than 1000 genes and only examined those that passed p-value threshold for FDR correction.
An ORA was the performed on the up-regulated significant differentially expressed genes:
| Genesets Returned | Significant Genesets | Ambiguous Genes | Failed Genes |
|---|---|---|---|
| 11431 | 460 | 40 | 6 |
An interactive Manhattan plot that summarises the query result of the up-regulated top-hits:
Summaries of the ORA result for the down-regulated genes:
| Genesets Returned | Significant Genesets | Ambiguous Genes | Failed Genes |
|---|---|---|---|
| 12987 | 1054 | 23 | 18 |
The result of the ORA using all the significant differentially expressed genes was compared against the other two using the up-regulated set and the down-regulated set in terms of the pathways returned from each source of annotation data. The number of gene sets returned appears to be proportional to the length of the query gene list. All these three results using three different gene lists reveal different information. In terms of GO biological processes, the result for all differentially expressed genes involve regulation of neutrophil, leukocyte, and myeloid cells. The result for up-regulated genes does not seem to have many informative GO terms except for viral gene expression and viral transcription. The result for the down-regulated genes has terms involve type I interferon and defense response to virus. In terms of REACTOME, pathways involving neutrophil regulation and myeloid leukocyte also appear in the result for all differentially expressed genes. The result for the up-regulated genes does not seem to have many informative terms either. The result for the down-regulated genes also shows signaling pathways for interferon. In terms of wikipathways, the results for all differentially expressed genes shows T-Cell antigen Receptor (TCR) Signaling Pathway and Type II interferon signaling (IFN); pathways involving type I interferon appear too. The result for the up-regulated genes also have pathways involving TCR. The result for the down-regulated genes is similar to the previous two sources, showing pathways involving Type I interferon and Type II interferon. In terms of KEGG, the result for all differentially expressed genes has “Coronavirus disease - COVID-19” has the top hit pathway, expected but not informative; many cancer-relative pathways also appear. The result for the up-regulated genes shows a pathway involving TCR. The KEGG pathways for the down-regulated genes are mostly cancer-related.
The ORA results highly support the conclusions drawn in the original paper. The authors also found pathways such as neutrophil activation, innate immune response, immune response to viral infection, type-I interferon signaling, and cytokine production for 771 upregulated genes they discovered in COVID-19 patients; pathways they found for 1231 down-regulated genes include lymphocyte differentiation and T-cell activation and regulation, which also appear in our ORA result. The authors concluded that T-cell are suppressed while neutrophils are activated as an indicator of host response to COVID-19 represented in the transcriptomic changes.[1]
There have been prior researches on host immune responses to COVID-19 showing findings that support these ORA results. An immune analysis on COVID-19 patients by Hadjadj et al. has shown findings that coincide with our ORA result. Their multiplex gene expression analysis showed an up-regulation of genes involved in type I interferon signaling in COVID-19 patients[10], while pathways involving type I interferon are present in our ORA results for down-regulated genes in healthy control. Anurag et al. observed increased total leukocyte count and differential neutrophil count in patients with severe COVID-19 infection[11], and GO terms leukocyte activation involved in immune response and neutrophil activation involved in immune response are present in the ORA result for all differentially expressed genes. Kalfaoglu B. et al. discovered a unique dynamics of T-cells in severe COVID-19 patients: T-cells become hyperactivated, proliferate and die rapidly before differentiating into Treg, while they have shown that the majority of Treg-type genes are regulated by TCR signaling[12]; this supports the involvement of T-cell signaling pathways in our ORA result.
source("./scripts/diff_expr_qlf.R", chdir = TRUE, local = knitr::knit_global())
BiocManager::install(version = "3.12", ask = FALSE)
Create a non-thresholded gene list ranked by \(-\log_{10}{\mathrm{FDR}} \ \cdot\ \sgn(\log{\fc})\):
if (!file.exists(file.path(getwd(), "data", "covid19_ranked_genelist.rnk"))) {
gene_rank <- data.frame(
genename = entrez_to_gname[rownames(qlf_output_hits$table)],
F_stat = -log(qlf_output_hits$table$FDR,
base = 10) * sign(qlf_output_hits$table$logFC)
)
gene_rank <- gene_rank[order(gene_rank$F_stat), ]
write.table(x = gene_rank,
file = file.path(getwd(),
"data",
"covid19_ranked_genelist.rnk"),
sep = "\t",
row.names = F,
col.names = F,
quote = F)
}
rnk_file <- file.path(getwd(), "data", "covid19_ranked_genelist.rnk")
We defined the rank to be \(-\log_{10}{\mathrm{FDR}} \ \cdot\ \sgn(\log{\fc})\). With the \(-\log_{10}{\mathrm{FDR}}\), the samller the (corrected) p-value for a gene is, the higher the rank of that gene, regardless of whether it is up- or downregulated. The other term \(\sgn(\log{\fc})\) take into account the factor of regulation of gene expression: if it is up-regulated, then it will be ranked from the top; if it is down-regulated, then it will be ranked from the bottom. Thereby we have genes that show the most statistically significant evidence of differential expression at the top for those upregulated, and the bottom of the list for those downregulated, while genes in the middle are least significant.
RCurl is required to download annotation data sets from Bader Lab:
if (! requireNamespace("RCurl", quietly=TRUE)) {
install.packages("RCurl")
}
htmltools is required to render local images in html:
if (! requireNamespace("htmltools", quietly = TRUE))
install.packages("htmltools")
Rcy3[13] is required to automate Cytoscape from R:
if (! requireNamespace("Rcy3", quietly=TRUE)) {
BiocManager::install("RCy3")
}
Download the gene sets for enrichment analysis from Bader Lab:
gmt_url <- "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames <- RCurl::getURL(gmt_url)
tc <- textConnection(filenames)
contents <- readLines(tc)
close(tc)
# get the gmt that has all the pathways and excludes terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx <- gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
contents,
perl = TRUE)
gmt_fname <- unlist(regmatches(contents, rx))
gmt_file_path <- file.path(getwd(), "data", gmt_fname)
if (!file.exists(gmt_file_path)) # Download the latest genesets if not existed
download.file(paste(gmt_url, gmt_fname, sep = ""), destfile = gmt_file_path)
Initialising GSEA parameters:
gsea_jar <- params$gsea_jar
java_ver <- params$java_ver
analysis_name <- params$analysis_name
run_GSEA <- params$runGSEA
gsea_directories <- list.files(path = file.path(getwd(), "data"),
pattern = paste0("^",
analysis_name,
"\\.GseaPreranked"))
if (length(gsea_directories) == 0)
run_GSEA <- TRUE
GSEA_params <- paste("-gmx", gmt_file_path,
"-rnk", rnk_file,
"-collapse", "false",
"-nperm", "1000", # max 1000
"-scoring_scheme", "weighted",
"-rpt_label", analysis_name,
"-plot_top_x", "20",
"-rnd_seed", "202141",
"-set_max", "1000", # same size limit as ORA
"-set_min", "1",
"-zip_report", "false",
"-out", file.path(getwd(), "data"),
"> gsea_output.txt",
sep = " ")
Run GSEA from command line:
if (java_ver == "11" && run_GSEA){
command <- paste("", gsea_jar, "GSEAPreRanked", GSEA_params, sep = " ")
system(command)
} else if (run_GSEA) {
command <- paste("java", "-Xmx1G",
"-cp", gsea_jar, "xtools.gsea.GseaPreranked",
GSEA_params, sep = " ")
system(command)
}
Get the directory containing GSEA output results:
gsea_directories <- list.files(path = file.path(getwd(), "data"),
pattern = paste0("^",
analysis_name,
"\\.GseaPreranked"))
# Get info of GSEA directories
gsea_dir_info <- file.info(file.path(getwd(), "data", gsea_directories))
# Order from latest to oldest
gsea_dir_info <- gsea_dir_info[with(gsea_dir_info,
order(as.POSIXct(mtime),
decreasing = TRUE)), ]
gsea_result_dir <- row.names(gsea_dir_info)[1] # Use the latest GSEA output result
What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.
The command line tool of Gene Set Enrichment Analysis (GSEA)[14], version 4.1.0, was used to perform this non-thresholded gene set enrichment analysis. The enrichment analysis we ran was GSEA pre-ranked, as we have pre-ranked our gene list by their \(\log{\FDR}\) values. The method used to perform the analysis from R was based on the Enrichment Map Protocol[15] by Bader Lab. The geneset used for the analysis was Gene Ontology biological process (GO:BP), updated on March 1st, with inferred from electronic (IEA) annotations excluded, compiled by Bader Lab[16].
The minimum size of genesets for this analysis is 1, and the maximum is 1000; these two parameters are consistent with the thresholded ORA so that the results of these two analyses are comparable. The number of permutations was set to 1000, which is the maximum possible permutations for GSEA, because this will give us the most accurate test statistics (p-values, q-values/FDR’s) possible.
Summarize your enrichment results.
| Statistics | na_pos | na_neg |
|---|---|---|
| up-regulated | 5994 | 11503 |
| FDR < 0.25 | 299 | 1537 |
| nominal p-value < 0.01 | 314 | 914 |
| nominal p-value < 0.05 | 659 | 1950 |
| top geneset associated | TRANSLATIONAL INITIATION GOBP GO:0006413 | HALLMARK_INTERFERON_ALPHA_RESPONSE MSIGDB_C2 HALLMARK_INTERFERON_ALPHA_RESPONSE |
Note that na_pos refers to genes that are up-regulated in healthy control (HC), or synonymously down-regulated in COVID-19 patients; and na_neg refers to genes that are down-regulated in healthy control (HC), or correspondently, up-regulated in COVID-19 patients. This is because of the design of our model matrix: HC = 1 and COVID-19 = 0.
| GSEA Statistics | TRANSLATIONAL INITIATION GOBP GO:0006413 | HALLMARK_INTERFERON_ALPHA_RESPONSE MSIGDB_C2 HALLMARK_INTERFERON_ALPHA_RESPONSE |
|---|---|---|
| Enrichment Score (ES) | 0.7147206 | -0.6752788 |
| Normalized Enrichment Score (NES) | 2.9827669 | -2.4945803 |
| FDR | 0.0 | 0.0 |
| Nominal P Value | 0.0 | 0.0 |
| Leading-edge Genes | 82 | 80 |
| Top Gene | EIF4B: Eukaryotic translation initiation factor 4B | EPSTI1: Epithelial stromal interaction 1 |
How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not?
It is not a straightforward comparison qualitatively. For the top geneset of the up-regulated genes and of the down-regulated genes, GSEA and thresholded ORA yielded highly similar results. Both analyses have translational initiation for the up-regulated genes in healthy control as the top geneset. The ORA result for the down-regulated genes in HC has interferon alpha signaling pathway as the top gene set in the query result of Reactome, and the GSEA result has interferon alpha response as its top gene set.
# Cytoscape apps required
cy_app <- c("clustermaker2", # Clustermaker2, version 0.9.5 or higher
"autoannotate", # AutoAnnotate, version 1.2.0 or higher
"wordcloud") # WordCloud, version 3.1.0 or higher
hasEM <- params$hasEM
if (! hasEM) cy_app <- c(cy_app, "EnrichmentMap") # EnrichmentMap 3.1.1 release
# get Cytoscape version
cy_ver <- unlist(
strsplit(
RCy3::cytoscapeVersionInfo()["cytoscapeVersion"], split = "\\."
)
)
if(length(cy_ver) == 3 &&
as.numeric(cy_ver[1] >= 3) &&
as.numeric(cy_ver[2] >= 7)) {
# vector to store install status of each app
install_responses <- vector(mode = "character", length = length(cy_app))
for(i in seq_along(cy_app)){
install_command <- paste0("apps install app=\"",
cy_app[i],
"\"")
install_response <- RCy3::commandsGET(install_command)
install_responses[i] <- install_response
}
install_summary <- data.frame(name = cy_app,
status = install_responses)
knitr::kable(list(install_summary),
booktabs = TRUE,
caption = 'Summary of Automated App Installation'
)
}
|
pvalue_gsea_threshold <- params$pval_thresh
qvalue_gsea_threshold <- params$fdr_thresh
similarity_coe_type <- "COMBINED"
similarity_coe_cutoff <- "0.375" # default cutoff for combined coefficient
cur_model_name <- analysis_name
gsea_results_path <- file.path(gsea_result_dir, "edb")
gsea_results_fname <- file.path(gsea_results_path, "results.edb")
gene_expr_file <- file.path(getwd(), "data", "Covid19_vs_HC_expr.txt")
gsea_rnk_file <- file.path(gsea_results_path,
list.files(gsea_results_path,
pattern=".rnk$"))
Loading helper functions to map docker file paths to the file paths in the local host machine:
source("./utils/docker_to_host.R", local = knitr::knit_global())
Map docker file paths to the actual local host machine:
isDocker <- params$isDocker
if (isDocker) {
isWindows <- params$isWindows # Map to MS-DOS file system
host_machine_dir <- params$host_machine_dir # local host machine directory
if (isWindows) {
gmt_file_path <- docker_to_dos(gmt_file_path, host_machine_dir)
gsea_results_path <- docker_to_dos(gsea_results_path, host_machine_dir)
gsea_results_fname <- docker_to_dos(gsea_results_fname, host_machine_dir)
gsea_rnk_file <- docker_to_dos(gsea_rnk_file, host_machine_dir)
gene_expr_file <- docker_to_dos(gene_expr_file, host_machine_dir)
} else {
isBetaEM <- params$isBetaEM # if using EnrichmentMap 3.3.2 Beta or higher
if (isBetaEM) {
gmt_file_path <- docker_to_em(gmt_file_path)
gsea_results_path <- docker_to_em(gsea_results_path)
gsea_results_fname <- docker_to_em(gsea_results_fname)
gsea_rnk_file <- docker_to_em(gsea_rnk_file)
gene_expr_file <- docker_to_em(gene_expr_file)
} else {
gmt_file_path <- docker_to_unix(gmt_file_path, host_machine_dir)
gsea_results_path <- docker_to_unix(gsea_results_path, host_machine_dir)
gsea_results_fname <- docker_to_unix(gsea_results_fname, host_machine_dir)
gsea_rnk_file <- docker_to_unix(gsea_rnk_file, host_machine_dir)
gene_expr_file <- docker_to_unix(gene_expr_file, host_machine_dir)
}
}
}
cur_network_name <- paste(analysis_name,
pvalue_gsea_threshold,
qvalue_gsea_threshold,
sep = "_")
em_command <- paste('enrichmentmap build',
'analysisType=', '"gsea"',
'networkName=', paste0('"', cur_network_name, '"'),
'gmtFile=', gmt_file_path,
'pvalue=', pvalue_gsea_threshold,
'qvalue=', qvalue_gsea_threshold,
'similaritycutoff=', similarity_coe_cutoff,
'coefficients=', similarity_coe_type,
'ranksDataset1=', gsea_rnk_file,
'enrichmentsDataset1=', gsea_results_fname,
'filterByExpressions=', 'false',
sep = " ")
em_response <- RCy3::commandsGET(em_command)
cur_suid <- 0
if (grepl(pattern = "Failed", em_response)){
message(response)
} else {
cur_suid <- em_response
}
cur_network_names <- RCy3::getNetworkList()
if (cur_network_name %in% cur_network_names) {
cur_network_name <- paste(cur_suid, cur_network_name, sep = "_")
}
em_response <- RCy3::renameNetwork(title = cur_network_name,
network = as.numeric(cur_suid))
screenshot prior to manual layout:
# zoom network view to maximize either height or width of current network window
RCy3::layoutNetwork(layout.name = "force-directed",
network = as.numeric(cur_suid))
RCy3::fitContent()
if (isDocker) {
if (isWindows) {
get_em_win(title = "Initial Screenshot network",
host_dir = host_machine_dir,
network = cur_suid)
} else {
get_em_unix(title = "Initial Screenshot network",
host_dir = host_machine_dir,
network = cur_suid)
}
} else {
get_em_view(title = "Initial Screenshot network",
network = cur_suid)
}
How many nodes and how many edges in the resulting map? What thresholds were used to create this map?
knitr::kable(
data.frame(
Node = RCy3::getNodeCount(network = as.integer(cur_suid)),
Edge = RCy3::getEdgeCount(network = as.integer(cur_suid)),
"p-value threshold" = pvalue_gsea_threshold,
"FDR threshold" = qvalue_gsea_threshold,
"Similarity Coefficient Type" = similarity_coe_type,
"Similarity Coefficient Cut-off" = similarity_coe_cutoff
)
)
| Node | Edge | p.value.threshold | FDR.threshold | Similarity.Coefficient.Type | Similarity.Coefficient.Cut.off |
|---|---|---|---|---|---|
| 1726 | 14516 | 0.05 | 0.25 | COMBINED | 0.375 |
Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well.
We annotate the network using AutoAnnotate[17] with the default parameters.
aa_command <- paste("autoannotate", "annotate-clusterBoosted",
"network=", cur_suid,
"clusterAlgorithm=", "MCL",
"edgeWeightColumn=", "EnrichmentMap::similarity_coefficient",
"labelColumn=", "EnrichmentMap::GS_DESCR",
"maxWords=", "3",
"useClusterMaker=", "true",
"createSingletonClusters=", "false")
aa_response <- RCy3::commandsRun(aa_command)
The parameters used for annotating the network:
| Parameter | Value |
|---|---|
| Cluster Source | clusterMaker2 |
| ClusterMaker Algorithm | MCL Cluster |
| Edge Attribute | EnrichmentMap::similarity_coefficient |
| Label Maker | WordCloud: Adjacent Words (default) |
| Max Words Per Label | 3 |
| Word Adjacent Bonus | 8 |
| Normalization Factor | 0.5 |
| Attribute Names | EnrichmentMap::GS_DESCR |
| Display Style | Clustered-Standard |
| Max Words per Cloud | 250 |
| Cluster Cutoff | 1.0 |
| Min Word Occurrence | 1 |
Note that GS_DESCR refers to the gene set description specified in the second column of Bader Lab’s gmt file.
The automation process ends here as the following works require manual arrangement of the network.
Make a publication ready figure - include this figure with proper legends in your notebook.
Where the colour gradient indicates the normalised enrichment score (NES) of a gene set; the darker the red colour the higher the positive NES of a geneset is, and the darker the blue colour the lower the negative NES of a geneset is.
Note that this figure is not created via Cytoscape automation process as the notebook is compiled; it is pre-made and has been manually arranged and annotated.
Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?
The collapsed network has 164 themes. To identify the major biological themes in this network, I manually arranged the themes (collapsed nodes) and grouped them according to their biological similarities (e.g. functions, diseases, etc.). The 8 clusters of themes represent the following major categories of themes present in this network: 1. Host immune response 2. Lipid 3. Ion transport 4. Structural protein regulation 5. Apotosis and proteolysis 6. Nervous system 7. DNA replication 8. Cancer The remaining themes are either unclear or difficult to be categorized. Group 3 is not surprising since one of the COVID-19 signature genes identified by the original study by Thair et al. [1] is ACO1, which is an iron-sulfur protein that regulates ferritin and transferrin. They also identified genes that are commonly studied in cancer in their COVID-19 specific genes; this explains the presence of themes in group 8. Themes in the lipid group are also expected as superfamily members of ATP-binding cassette transporters that facilitate the interaction of immune cells with various classes of lipids are identified in their signature set. The only group of themes that is not introduced in the paper is nervous system (group 7), which could be considered novel in this context.
Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods
As shown in the collapsed theme network above, except for the themes related to nervous system and themes uncategorised, the majority of biological processes present in the enrichment map are consistent with the original paper. The difference between the GSEA result and the thresholded ORA is that, given the same p-value and FDR cut-off (0.05), and the maximum size limit of gene sets (1000), biological processes involving lipid and ion transport present in the significant gene sets of the GSEA result are absent in the significant gene sets of the ORA result, while some genes involved in these processes are identified to be part of the COVID-19 signature genes. This comparison shows how the thresholded ORA could miss relatively weaker signals.
Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?
Apart from the evidence we previously introduced for the thresholded ORA result, and the evidence provided by the original paper regarding the presence of biological themes of ion transport, lipid, and cancer, there indeed exists supporting evidence for the biological themes of the nervous system, which was not mentioned in the original paper. Tremblay ME et al. discovered that SARS-CoV-2 infection leads to loss of physiological functions of the astrocytes and microglia, which contribute to the autoimmunity of central nervous system.[18] This coincides with the presence of theme astrocyte neuroinflammation microglia in group 6:
Specifically, Tremblay ME et al. found that sysmetic infection caused by COVID-19 triggers considerable increase in circulating levels of chemokines and interleukins, which compromise the blood-brain barrier, enter the brain parenchyma and affect astrocytes and microglia.[18] This is relatable to the original study by Thair et al.. ATL3, which is one of the most significant COVID-19 signature gene identified in the study, is found to be down-regulated in COVID-19 patients. ATL3 is a member of the integral membrane GTPases that proper formation of ER tubules relies on. Down-regulated ATL3 leads to delayed cargo exit and coat assembly for budding from the ER, which is necessary for export of chemokines in response to the infection.[1] This suggests a link between astrocytes and microglia damaged by COVID-19 infection and the expression level of the signature gene ATL3.
Since we have biological themes in abundance, we will continue exploring the main network and ignore the dark matters this time.
sig_url <- "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/DrugTargets/"
# list all the files on the server
filenames <- RCurl::getURL(sig_url)
tc <- textConnection(filenames)
contents <- readLines(tc)
close(tc)
# get the gmt that has all the pathways and excludes terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx <- gregexpr("(?<=<a href=\")(.*.DrugBank_approved.*.)(.gmt)(?=\">)",
contents,
perl = TRUE)
sig_fname <- unlist(regmatches(contents, rx))
sig_file_path <- file.path(getwd(), "data", sig_fname)
if (!file.exists(sig_file_path)) # Download the latest genesets if not existed
download.file(paste(sig_url, sig_fname, sep = ""), destfile = sig_file_path)
While there currently exist very few treatment options, the original study by Thair et al.[1] also attempted to repurpose approved drugs that have passed the safety trials through examining the overlap of the host response to the novel COVID-19 and the other six viruses that have been better studied.
One drug repurposed in the original paper is peginterferon alfa-2a. Although pathways of type I interferon activity are present in both of the GSEA and the thresholded ORA results for genes up-regulated in COVID-19 patients, Hadjadj et al.[10] discovered impaired type I interferon activity in severe and critical COVID-19 patients. Hadjadj et al. also noticed this variability in type I IFN responses to infection, suggesting the virus has mechanisms to disable host IFN production; the mechanisms per se and the timing of when the mechanisms are triggered are yet to be investigated. This discrepancy of these two studies could come from the fact that the original study did not take into account the severity of the disease, and the whole blood for RNA-seq was collected from the biological replicates admitted to the hospital within 24 hours. As this drug facilitates type I IFN responses, adding this drug into our signature gene set might hopefully reveal potential pathways that could be related to the mechanisms.
| Drug | Test | test statistics |
|---|---|---|
| Peginterferon alfa-2a | Mann–Whitney (One-sided Less) | 0.0775 |
We choose the left-tail Mann-Whitney test, since our model is relative to HC and stronger type I INF response was found in our COVID-19 samples.
Themes that interact with peginterferon alfa-2a
Pathways / biological processes in which peginterferon alfa-2a is involved:
Only common pathways/processes in the innate immune system show up in the network; this drug does not reveal novel pathways that allow us to investigate the unknown mechanism.
The top drug that turned up in the left-tail Mann-Whitney test is fostamatinib:
| Drug | Test | test statistics |
|---|---|---|
| Fostamatinib | Mann–Whitney (One-sided Less) | 1.1223e-5 |
Themes that interact with fostamatinib: